Goto

Collaborating Authors

 average-reward mdp



Sample Complexity of Average-Reward Q-Learning: From Single-agent to Federated Reinforcement Learning

Jiao, Yuchen, Woo, Jiin, Li, Gen, Joshi, Gauri, Chi, Yuejie

arXiv.org Machine Learning

Average-reward reinforcement learning offers a principled framework for long-term decision-making by maximizing the mean reward per time step. Although Q-learning is a widely used model-free algorithm with established sample complexity in discounted and finite-horizon Markov decision processes (MDPs), its theoretical guarantees for average-reward settings remain limited. This work studies a simple but effective Q-learning algorithm for average-reward MDPs with finite state and action spaces under the weakly communicating assumption, covering both single-agent and federated scenarios. For the single-agent case, we show that Q-learning with carefully chosen parameters achieves sample complexity $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{\varepsilon^3}\right)$, where $\|h^{\star}\|_{\mathsf{sp}}$ is the span norm of the bias function, improving previous results by at least a factor of $\frac{\|h^{\star}\|_{\mathsf{sp}}^2}{\varepsilon^2}$. In the federated setting with $M$ agents, we prove that collaboration reduces the per-agent sample complexity to $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{M\varepsilon^3}\right)$, with only $\widetilde{O}\left(\frac{\|h^{\star}\|_{\mathsf{sp}}}{\varepsilon}\right)$ communication rounds required. These results establish the first federated Q-learning algorithm for average-reward MDPs, with provable efficiency in both sample and communication complexity.


Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs

Neural Information Processing Systems

We study the sample complexity of learning an $\varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. For weakly communicating MDPs, we establish the complexity bound $\widetilde{O}\left(SA\frac{\mathsf{H}}{\varepsilon^2} \right)$, where $\mathsf{H}$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,\mathsf{H}$, and $\varepsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. We also initiate the study of sample complexity in general (multichain) average-reward MDPs.


A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Neural Information Processing Systems

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e.



Near-Optimal Sample Complexity Bounds for Constrained Average-Reward MDPs

Wei, Yukuan, Li, Xudong, Yang, Lin F.

arXiv.org Machine Learning

Recent advances have significantly improved our understanding of the sample complexity of learning in average-reward Markov decision processes (AMDPs) under the generative model. However, much less is known about the constrained average-reward MDP (CAMDP), where policies must satisfy long-run average constraints. In this work, we address this gap by studying the sample complexity of learning an $ε$-optimal policy in CAMDPs under a generative model. We propose a model-based algorithm that operates under two settings: (i) relaxed feasibility, which allows small constraint violations, and (ii) strict feasibility, where the output policy satisfies the constraint. We show that our algorithm achieves sample complexities of $\tilde{O}\left(\frac{S A (B+H)}{ ε^2}\right)$ and $\tilde{O} \left(\frac{S A (B+H)}{ε^2 ζ^2} \right)$ under the relaxed and strict feasibility settings, respectively. Here, $ζ$ is the Slater constant indicating the size of the feasible region, $H$ is the span bound of the bias function, and $B$ is the transient time bound. Moreover, a matching lower bound of $\tildeΩ\left(\frac{S A (B+H)}{ ε^2ζ^2}\right)$ for the strict feasibility case is established, thus providing the first minimax-optimal bounds for CAMDPs. Our results close the theoretical gap in understanding the complexity of constrained average-reward MDPs.


Faster Fixed-Point Methods for Multichain MDPs

Zurek, Matthew, Chen, Yudong

arXiv.org Machine Learning

We study value-iteration (VI) algorithms for solving general (a.k.a. multichain) Markov decision processes (MDPs) under the average-reward criterion, a fundamental but theoretically challenging setting. Beyond the difficulties inherent to all average-reward problems posed by the lack of contractivity and non-uniqueness of solutions to the Bellman operator, in the multichain setting an optimal policy must solve the navigation subproblem of steering towards the best connected component, in addition to optimizing long-run performance within each component. We develop algorithms which better solve this navigational subproblem in order to achieve faster convergence for multichain MDPs, obtaining improved rates of convergence and sharper measures of complexity relative to prior work. Many key components of our results are of potential independent interest, including novel connections between average-reward and discounted problems, optimal fixed-point methods for discounted VI which extend to general Banach spaces, new sublinear convergence rates for the discounted value error, and refined suboptimality decompositions for multichain MDPs. Overall our results yield faster convergence rates for discounted and average-reward problems and expand the theoretical foundations of VI approaches.


Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs

Neural Information Processing Systems

We study the sample complexity of learning an \varepsilon -optimal policy in an average-reward Markov decision process (MDP) under a generative model. For weakly communicating MDPs, we establish the complexity bound \widetilde{O}\left(SA\frac{\mathsf{H}}{\varepsilon 2} \right), where \mathsf{H} is the span of the bias function of the optimal policy and SA is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters S,A,\mathsf{H}, and \varepsilon, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. We also initiate the study of sample complexity in general (multichain) average-reward MDPs. Both results are based on reducing the average-reward MDP to a discounted MDP, which requires new ideas in the general setting.


Review for NeurIPS paper: A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Neural Information Processing Systems

Correctness: The main technical content seems to be correct. I have the following questions though: When using the linear assumption for the reward and the dynamics, the feature selection/setting is crutial. To relax the linear assumption, it is also mentioned, features can be pre-trained. What would be the recommended way to pre-learn it? For possible violation of the assumptions, how it would affect the results in practice?


Review for NeurIPS paper: A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Neural Information Processing Systems

This is a borderline paper. The paper is technically sound and addressing OPE in average-reward setting is an important problem. Despite that the work is an extension of Duan and Wang (for discounted setting) to the average-reward setting, the algorithm is somewhat different, as Duan and Wang uses FQE whereas the current paper performs stationary-distribution estimation. That said, there are a few weaknesses that the paper should try to address or at least discuss: 1. The entropy maximization is a novel algorithmic element which does not appear in previous approaches in the discounted setting.